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Abstract 

We investigate the behavior of a population genetics model introduced by 
( Waxman fc Peck, 1998| ) incorporating mutation, selection, and pleiotropy. The 



population is infinite and continuous variation of genotype is allowed. Nonethe- 
less, Waxman and Peck showed that if the degree of pleiotropy is large enough, 
in this model a nonzero fraction of the population can have identical alleles. 
This 'condensed mode' behavior appears in the limit of infinite times. 

This paper explores the time-dependence of the distribution of alleles in this 
model. First, the model is analyzed using a recursion technique which enables 
the distribution of alleles to be calculated at finite times as well as in Waxman 
and Peck's infinite-time limit. Second, both Waxman and Peck's original model 
and a related model in which mutations occur continuously are mapped onto 
problems in quantum mechanics. In both cases, the long-time analysis for the 
biological model is equivalent to finding the nature of the eigenstates of the 
quantum problem. The condensed mode appears if and only if there is no 
bound state in the quantum problem. 

We compare the behavior of the discrete- and continuous-time versions of 
the model. The results for the two cases are qualitatively similar, though there 
are some quantitative differences. We also discuss our attempts to correlate the 
statistics of DNA sequence variations with the degree of pleiotropy of various 
genes. 



*The James Franck Institute, The University of Chicago, 5640 S. Ellis Avenue, Chicago, IL 60637 
t Corresponding author. Email address: snc@control.uchicago.edu; Fax number: (773) 702-5863. 
*The Hospital for Special Surgery, 535 East 70th Street, New York, NY 10021 



1 



1 Introduction 



Recently, ( |Waxman fc Peck, 1998| ) introduced a simple population genetics model 
of an infinite population with a continuous distributions of alleles^ incorporating 
pleiotropy (one gene affecting several characters of an organism) . They demonstrated 
that when the number of characters affected is greater than two, the long-time steady 
state solution of their model can have a nonzero fraction of the population with 
identical alleles, and that this phenomenon does not occur if the number of characters 
affected per gene is two or fewer. 

In this paper we consider their model and a simple variant of it, addressing several 
issues. The first three issues involve the mathematical analysis of the theory. First 
we study the time evolution of the distribution of alleles. In the model, the initial 
distribution of alleles is continuous, and at infinite time there is an infinitely narrow 
(5-function) peak;[] we ask how the distribution of alleles depends on time at long 
but finite times. Second, we describe how to construct time-dependent solutions by 
using an expansion in a sum of functions (Gaussians) particularly picked to meet 
the requirements of this problem. Third, we examine the relationship between the 
model with discrete generations used by Waxman and Peck (where mutations occur 
at discrete times, and fitness selection occurs continuously) and a continuous-time 
model in which both mutations and fitness selection occur continuously. The latter is 
a limiting case of the discrete model arising when the time between generations gets 
very short. We demonstrate that the qualitative behavior of the two models is the 
same, but that dependence of the behavior on the parameters can be different in the 
two models. 

We show that the population genetics models can be mapped onto problems in 
quantum mechanics, specifically Bose-Einstein condensationP|( |dc Groot et ai, 1950| ) 
and motion of a particle in a central potential.^ The mapping onto Bose condensation 
is performed on the discrete-time model in the limit that the fitness selection is 
very strong so that only organisms with fitnesses very near optimum survive each 



1 Waxman and Peck's model is based on the continuu m-of-allclcs m od el of (Kimura, 1965). Other 
important inve stigati ons of plciotr opic models include ( Wagner, 1989 ), ( Landc, 1980 ), ( Gavrilcts fc 
|dc Jong, 1993|) , and ( [Turelli, 1985| ). 

2 The Dirac delta-function, S(t), is defined to be a function very sharply peaked around t = 0, but 
having a total weight J dt8(t) equal to one. One definition of such a function is the limit as z goes 

to positive infinity of (2irz) 1//2 exp[— t 2 /(2z)]. We shall deal with variables x with N components. 
For such variables, S(x) is defined to be a delta function in the first component of x, times one in 
the second component, .... In this way, one has J d N x S(x) = l.(Spanier & Oldham, 1987) 

3 Bose-Einstein condensation in three dimensions is discussed in many texts on statistical me- 
chanics, e.g., (Feynman, 1972). 

4 See any book on quantum mechanics, e.g., ( Schiff, 1968| ). 
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generation. The mapping onto the quantum particle in a central potential applies 
more generally. The simplest case has the potential independent of time. This case 
corresponds to the a limit of the population model in which fitness selection and 
mutations both occur continuously. The close relation between the continuous-time 
population biology models and Schrodinger's equation has been exploited by Burgerf] 
to obtain many general results on the long-time behavior of models of the type studied 
here. In this paper, we focus on the mapping for a specific model, which yields a 
simple physical interpretation of the emergence of a unique genotype, and allows 
the extraction of the time-dependence of the fitness peak as well as comparison to 
the discrete-time model. We find that the behavior is qualitatively identical when 
the mutations are continuous and when they are discrete, though there are some 
quantitative differences. 

We also discuss our attempts to relate the results from this model to the statistical 
properties of the sequences that are archived in various sequence databases. We 
analyze the statistical properties of gene sequences tabulated in genetic databases and 
attempt to correlate observed variations with estimates of the degree of pleiotropy of 
different genes in the databases. The results of these attempts are inconclusive. 

This paper is organized as follows. In section |2] we introduce the theoretical 
models. Section [3] presents our analysis of the time evolution of the discrete version 
of the model, demonstrating that the behavior can be extracted analytically in a 
limit in which the typical jump caused by mutations is very large. In section |] we 
present our analysis of the continuous-time model. Section [5] contains remarks on the 
relationship between the continuous- and discrete-time models. Section ^ presents our 
examination of DNA sequence archives and our attempts to correlate the statistics of 
documented sequence variations to the degree of pleiotropy of various genes. Section 
recaps our main conclusions. The Appendix contains mathematical details of the 
analysis of the time-dependence of the continuous-time model aimed at establishing 
the long-time time-dependence. 



2 The Models 



One model we study is that of ( [Waxman fc Peck, 1998|) . In this model, the population 



is infinite, the phenotypic variation is continuous, each gene affects N characters, and 
the effects of different genes are uncorrelated (no linkage disequilibrium or epistasis). 
The model assumes a very large population of haploid and asexual organisms with 
discrete generations. Let x be a continuous vector with N components which rep- 
resents characters which determine the viability of an organism, and <p(x, t) be the 



3 A recent paper with references to previous work is ([Burger fc Bomze, 1996) 
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normalized probability density that an organism in the population has the characters 
given by x at time t. The normalization condition for the probability is given by an 
iV-dimensional integral as 



(x,t)d x 



1 



At each generation an organism with characters x survives viability selection with a 
probability proportional to exp(— x 2 /2V s ). This Gaussian fitness selection will play 
an essential role in our solution of the model. Once each generation, a fraction 9 of 
the population mutates; it is assumed that if a mutation occurs, then the probability 
that the mutant takes on the value x given the parental value y is f(x — y). 

Since t) is a probability density, it is normalized at every time t. Thus, for 
this model, the equation for the time evolution of 4>(x, t) is: 



r(t+i)(f){x,t + i) 



9)w 1 (x)(j){x,t) + e / f{x-y) Wl {y)(t>{y,t)d N y 



where the fitness factor w\{x) is 

Wi(x) = exp[— x 2 /2V s ] . 



(2) 



(3) 



The multiplicative factor, T(t + 1), in Eq. (|2]) ensures that the probability, <p(x, t + 1) 
is properly normalized at every time step. Integrating Eq. (|2|) over all x gives the 
Waxman and Peck result for this normalization: 



r(t + 1) = J (f>(x,t)w 1 (x)d N x . 



(4) 



We follow Waxman and Peck in using a Gaussian function for the mutation proba- 
bility: 

f(x -y) = (2irm 2 )- N/2 exp[-(f - y) 2 /2m 2 }, (5) 

where m 2 describes the variance of the mutant effects for a single character. 

In addition to this discrete-time model, we will consider a model in which both 
mutations and selection occur continuously in time. The relation between the discrete 
and continuous-time models can be seen explicitly by noting that the differential 
equation 

dcj)(x, t) 



Ot 



x 



a(t) 

V ; 2V 



<f)(x, t) 



has the solution 



[x, t + t) 



exp 



t+T 



X 



ds a(s) r 

t V ' 2V 



<t>{x,t). 



(6) 



(7) 
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Thus, model Eq. (|2|) can be written as a differential equation: 

d<p(x, t) 



dt 



a{t)- 



x 



2V 



(j)(x, t) 

+ 6(t - nr) f dyf{x-y) [<f>(y, t) - <f>(x , t)} 



where r is the interval between mutations. Here, a(t) is determined by the nor- 
malization requirement of Eq. ([I]) and 5(t) is the Dirac delta function. In dynam- 
ical systems theory, equations with a sum of time-delta functions are said to be 
'kicked' QLichtenberg fc Liberman, 1992|) . When the interval between kicks (here r) 
goes to zero, the sum of delta-function is replaced by its time average, so that Eq. (|8|) 
reduces to: 

d(j)(x, t) 



dt 



a(t) 



x 
2V 



</>(£, t) + G dy f(x - y) [<t>(y, t) - <f)(x, t)] 



(9) 



where is the rate of mutations per unit time, 6/t, and a(t) is determined by the 
normalization requirement, Eq. (jl[) 

In the next few sections we calculate the evolution of t) for different N, with 
an emphasis on the behavior near x = at long times. In section |3] we examine the 
discrete-time model Eq. (0); section |] discusses the continuous time equation Eq. (|9|). 
In section [| we discuss the relationship between the results for the continuous and 
discrete-time equations. 



3 Time Evolution of Discrete Equations 

In this section we consider the discrete-time evolution Eq. We will solve it by 
exploiting the following observation. Suppose at some time t 4>(x, t) is a normalized 
Gaussian with variance a, so that <f>(x, t) = G a (x), wheref] 

„2 ' 



G a (x) = (27ia)- N/2 



cxp 



X' 

2a 



(10) 



Eq. (|D then implies that (p(x,t + 1) is the sum of two Gaussians. Indeed, at any 
finite time t, 4>{x, t) is the sum of finitely many Gaussians. 

To see how this works, notice that Eq. (|2]) involves the processes of multiplication 
and of convolution with a Gaussian. If one multiplies two Gaussians, one gets another 
Gaussian according to the rule: 

G a (x)G {x) = (-^— ) N/2 G,(x), where l/ 7 = 1/a + 1/(3 . (11) 
zTiap 



'Strictly speaking, a is the variance of one component of x. 
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Convolution has slightly different rules. A textbook integration yields the simple 
result: 

J G a (x — y)G p{y)d N y = G$(x) , where 5 = a + (3 . (12) 

Instead of doing integrals at each time step, we can write recursion relations for 
the evolution of the widths and amplitudes of the Gaussians. We then solve these 
equations in the limit V s <C m 2 . Our method of solution has similarities with that 
used by ( |Kingman, 1978| ) on a simpler model. 

To perform the calculation explicitly, first consider a warm-up problem — the case 
of no mutations (9 = 0), which is described by the equation: 

T(t + l)(f)(x, t + 1) = exp[-x 2 /2V s ](f)(x, t) , (13) 

where T(t+1) is defined in Eq. (f|). We assume that, at time t, <f)(x, t) is a normalized 
Gaussian with variance a(t): 

<p(x,t) = G a(t) (x) (14) 

The multiplication process of Eq. (|Tl"| ) implies that the value of <p(x,t + 1) is, once 
again a Gaussian of the form G a r t +i)(x), but that the new value of a is given by 
l/a(t + 1) = l/a(t) + 1/V S . This equation for a(t) has the solution: 

Q(i) = tt^Tk • (15) 

where = a(t = 0). Thus the population distribution remains Gaussian, and at 
long times, its width scales as \x\ ~ t -1 / 2 . As expected, fitness selection continually 
narrows the distribution. 

Now we examine the full Eq. (pD with 9 > 0. We first calculate (f)(x,t = 1), given 
that (f)(x, t = 0) is a Gaussian. There are two terms in the equation. Each involves 
the multiplication of Gaussians, and the mutation term also involves a convolution. 
By the rules of Eqs. (|TT|) and (0), <p(x,t = 1) is the sum of two Gaussians of the 
form: 

4>(x, 1) = (1 - 9)G Ma) (x) + 9G m (x), (16) 
where the two /3's are given by: 

1 + a/Vs 

(3 2 (a) =m 2 + (3 1 (a) , (18) 

It follows immediately that if <fi(x,t) is the sum of finitely many Gaussians, then 
4>(x, t + 1) is also the sum of finitely many Gaussians. Explicitly, we write 4>(x,t) in 
the form: 

<j ) (x,t) = J2Mt)G adt) (x) , (19) 
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with the normalization condition that J2i Ai — 1- Then <p(x, t + 1) can be written: 



r(t + i)0(f,t + i) = {\-e)Y J BmG^ am {x) + eY J BmG^ a m^) - (20) 

i i 

with: 

B, l (t) = A(t)[l + a l (t)/V s ]- N/2 , (21) 
T(t + l)=J2B i (t) . (22) 

i 

Eqs. (|20| - p2"D simplify considerably when V s /m 2 <C 1. In this case, the character- 
istic width of the mutation probability function, /, is much greater than the width 
produced by one step of the fitness narrowing. In this big-mutation-jump limit, 

f3i(m 2 ) 7H V s , him 2 ) « m 2 , (3i(V s /n) = V s /(n + 1), feiVa/n) « m 2 .[] (Here, n is any 
positive integer.) 

Consider the time evolution of an initial distribution consisting of a Gaussian with 
variance a = m 2 . After one time step, the distribution splits into two Gaussians, an 
unmutated population with variance (3 ~ and a mutated population with variance 
(3 ~ m 2 . After two time steps, the population consists of three Gaussians with (3 
values ~ m 2 , V s , and V s /2. After n steps, the population consists of n + 1 Gaussians 
with (3 values m 2 , V s , V s /2, V s /n. We define «o = m2 , ®i = V s /i for i > 1. (Note 
that these a's are time- independent.) If we start with an initial distribution which is 
a Gaussian of width much greater than m, then after n steps, the distribution consists 
of n + 1 Gaussians with (3 values a , a n . We describe the time evolution using 
rate equations for the populations in these states. Fig. 1 is a sketch of the transitions 
that can occur between the different a's. Note that at a given time, a state «j makes 
transitions to two states, a and a i+ i. 

We write 4>(x, t) as: 

( j>(x,t)=J2Mt)Ga i (Z) , (23) 

i=0 

where the Ai(t) must obey:. 

A (t >0)=d (24) 
A 1 (t>0) = ^^QA (t-l) , (25) 



lK ! T(t) 



11 N/2 

A-iit-l) (i > 1, t > 0) , (26) 



Because the newly-mutated population is a Gaussian of width determined by m 2 for any dis- 
tribution, this approximation is equivalent to the "house-of-cards" approximation introduced by 
( Kingman, 1978| ) and investigated in (Turclli, 1984), (Barton & Turclli, 198E). One can show that 



the results derived in this section are robust when this approximation is not made, so long as one is 
the regime where V s <C m 2 . 
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and 



T(t + l) = QA (t) + J2Mt)(—r) ■ (27) 

We have denned Q = (1 + m 2 /V^) _7V//2 ; note that Q 1 in the limit we consider. 

First consider the first two steps of the evolution. Starting with the initial condi- 
tion A (t = 0) = 1, Ai(t = 0) = 0, one finds: 

A (t = 1) = 9 , 
A 1 (t = l) = l-6. (28) 

So far not much has happened. But a key thing happens at the next step: 



A 1 (t = 2) 



A (t = 2) = 9, 
9Q{1- 



6Q + (1- 9)2~ N I 2 



/ 1 _ a\2n-N/2 

Note that A x {t = 2) oc Q < 1. In fact, for any t > 2, all Aj's with < z < t are 
proportional to Q. However, as t — > oo, the number of these terms diverges. So it 
is not obvious whether as £ — > cxd a solution exists where A)(t) = ^> ^t(^) = ^(1), 
and all other Aj(i)'s are small, which would imply that the Gaussian describing the 
unmutated population contains a nonzero fraction of the total population. We will 
find that such a solution can exist only when N > 2. 

First we find the steady-state solutions in the long-time limit t — > oo. In steady 
state, the time arguments on the Ai and on Y can be dropped, and the recursion 
relations become: 

A = 9 (30) 
A 1 = 9Qv , (31) 
- V 



and 



N/2 

A-i (* > 1) , (32) 



1-9 1 f i \ N ' 2 

= #Q+hm£A t — -) , (33) 

where we have defined v = (1 — ^)/r. We explicitly allow for the possibility that the 
amplitude = limt^ooAtit) is nonzero. 
The solution to Eqs. ( j30| - |33"l) is: 



A = 9, 

Ai = 9Qi~ N/2 v i , % > 1 , (34) 
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where v must satisfy: 



i n oo 

^ = A OQ + 6QY,v l 



i=0 



1 \ N/2 
i + l) 



(35) 



Note that v < 1, for otherwise the Aj cannot sum to unity, which is required for 
normalization of the probability.^ Since as v increases the left hand side of Eq. (|35|) 
monotonically decreases and the right hand side monotonically increases, a solution 

with Aoo = can exist only if 9Q J2ilo ( I+T ) > 1 — 9- Conversely, if this inequality 
is not satisfied, then we expect > 0. Since A^ ~ Hm^^v 1 , we require v — 1 
if Aoo is nonzero. We show below that such a solution is the long-time limit of the 
solution of the time-dependent equations. 

First consider the case when Aoo ^ 0. Since the sum in Eq. fl3"5p converges if 
N > 2, and diverges otherwise, we see that Aoo ^ can occur only if N > 2. When 
the sum does converge, the unmutated fitness peak, whose amplitude is Aoq and 
whose width is given by Eq. (0), can contain a nonzero fraction of the total weight. 
Evaluating Eq. (0) with v = 1, we find A^ = 1 - 9 - 9Q((N/2), where ((N/2) is 
the Riemann zeta function( |Carrier et al, 1983| ; [Abramowitz fc Stegun, 1972| ). 

Now we consider the possibility of a solution in which the unmutated population 
constitutes an infinitesimal fraction of the total population as t — > oo. Since A^ = 0, 
we must have: 

*(N/2,v), (36) 



9Q 



where v < 1 and $(a, x) is the polylogarithm function (|Lewin, 1981|) : 



oo ^ 

$(a,a;) = 



(37) 



i=l 



When a > 1, $(a,x = 1) = £(a) is finite, where again ((z) is the Riemann zeta 
function. Therefore, for N > 2, the the right hand side of Eq. (|36D is bounded as 
v —> 1, and a solution exists only if (1 - 9)/(9Q) < ((N/2). In the Q < 1 limit that 
we have assumed, this happens only when 1 — 9 is very small also. If this condition 
is not satisfied, then a nonzero fraction of the population must be in the unmutated 
state. 

When N = 2, $(1, v) = - log(l - v), and in the limit 9Q/(1 - 9) < 1, one finds 



1 



exp 



1 



9Q 



(3? 



8 This inequality is derived in ( Waxman fc Peck, 1998[), a nd is closely related to results derived 
by Burger and collaborators (Burger, 1986), (Burger, 1988), (Burger & Hofbauer, 1994), (Burger 
|fc Bomze, 199^ ) 
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so that: 

r « (1 - 0) U + exp J (iV = 2). (39) 

When N — 1, the leading order term as i> — > 1 can be calculated by noting that, for 
all v < 1, $(|,i>) obeys the bounds:^ 

dx —=v x < $(-, d) < / dx —=v x . (40) 
l 2 Jo vx 



The integrals both converge at their lower limits. They also converge at their upper 
limits when v < 1, but not when t> = 1. The behavior near the upper limit can be 
used to estimate that, as v approaches 1, 



1 f°° 1 

< ^(o> t ') ~ / dx —= exp[— x\ In v\] 

2 jo \J x 



7T 



2J|lnf| 

« 7t/(2v / T^) . (41) 

Thus, v « 1 - [(tt0Q)/(2(1 - #))] 2 , and: 

r«(i-*) + I (.v = i). (i2) 




Now we calculate the function <p(x,t — ► oo).Q First we consider the case when 
has a 5-function piece. 

In this regime, as t — > oo, A = 0, A, = 9Q(l/i) N ^ 2 (i > 1), and Aoo, the weight 
in the 5— function, is Aoo = 1 — 9 — 6Q((N/2), where ((N/2) is the Riemann zeta 

9 These bounds follow because the integrand is monotonically decreasing. For any positive integer 
7 and v < 1, one has f J+1 dx -k=v x < -K=v^ < P. , dx -k=v x . 

10 The form of the distribution calculated by Waxman and Peck (their footnote 29) does not apply 
here because they require simultaneous validity of the inequalities 9V s /m 2 <C 1, m 2 /V s <C 1, and 
N -C Vs/m 2 . Our calculation applies whenever V s /m 2 -C 1. 
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function. Therefore, 



(x, t — > oo) 



[1-6- 6Q£(N/2))5(x) + 6(2nm 2 )- N/2 exp 

oo 

+6Q Y,(l/n) N/2 (2nV s /ny N / 2 exp 

71=1 

[1-9- 6Q((N/2))6(x) + 6{27cm 2 )- N/2 exp 



x 



2m 2 

2 



n.r 



2V S 

x 2 
2m 2 



+9Q(2tiV s )- n/2 exp 



x 

2K 



-i 



1 



(43) 



Thus we see that when <fr(x,t — > oo) has a 5-function piece, there is in addition a 
divergent contribution at small x, proportional to 1/x 2 . When the distribution is 
smooth, we have A = 6 and A,- t = 6Q{l/i) N l 2 v l (i > 1), and v < 1, and we obtain: 



{x,t ^oo) = 6(2Tim 2 )- N/2 



exp 



+6Q{2W s y N l 2 



v 1 exp 



x 

2K 




(44) 



The sums that arise here are identical to those that come up in the calculation of 
Bose-Einstein condensation for an ideal Bose gas flde Groot et al, 1950]) . The weight 
in the 5— function in the genotype distribution, A^, is analogous to the condensate 
fraction in the Bose condensation problem. The parameter N, which here describes 
the number of traits affected by a mutation, is the number of spatial dimensions in 
the Bose gas calculation. That the superfluid fraction must be zero for a Bose gas 
in one and two dimensions is a special case of a general result in the theory of phase 
transitions( |Mermin fc Wagner, 1966| ), (|Hohenberg, 1967| ), ( [Frolich fc Pfister, 1981| ). 

We now discuss the time evolution of the population distribution. We begin with 
some qualitative remarks. When v < 1, the sum in the solution Eq. (^) converges 
geometrically. Therefore, in this regime one expects the approach to the t — » oo 
limit to be exponential. In the regime where a 5— function contribution is present as 
t — > oo, the S— function is the long-time limit of a Gaussian describing the unmutated 
population. The squared width of this Gaussian narrows as 1/t, so we expect the 
approach to the long-time limit to be power-law. These expectations are supported 
by the explicit calculation that we now present. We, in fact, show that the long-time 
corrections to the amplitude of the 5— function are also proportional to 1/t. 

We again write recursion relations describing the transitions between the various 
Gaussians, now allowing for explicit time dependence in the Ai(t). Defining v(t) = 
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(1 — 9)/Y{t + 1), these recursion relations are, for t > 0: 



Ao(t) = & » 
A^t) = Qv{t - l)A {t - 1) 



(45) 
(46) 
(47) 

For the initial conditions A (t = 0) = 1 and A i>0 (t = 0) = 0, these recursion relations 
have the solution: 



Ai(t) = «(t - 1) 
1 - 6 



Ai-\{t - 1) 

N/2 



t-1 



t'=t-i 

The self-consistency condition is: 
1-9 Q 



n\ N / 2 



(49) 



t+i / 1 \ Ar / 2 



E 



A)(*-j) n «(o 



(50) 



Now we separate out explicitly the term j = t + 1; this is reasonable because w(t' = 0) 
is larger than all the other v 's by a factor proportional to 1/Q, yielding: 



1 ^ = ^Ef 1 ) JV/2 n 



1 y AT/2 t-1 

N n «co 

f=l 



t + 1 



(51) 



Comparison of this result to that of the steady-state analysis (Eq. (|34])) reveals that 
as t — > oo the last term on the rhs is just Aqo, the amplitude of the 5— function. 
Finally, defining j(t) = I\t'=i v ('t')y we find 



'1 - 



AT/2 



7(«) 



( l x^/ 2 



(52) 



Recall that at long times approaches a limit t> < 1. Therefore, for large t, 
j(t) = CjV*, where C 7 is a constant, up to corrections that vanish as t — > oo, and 



AT/2 



W J + (1 



1 \ )V / 2 



t + 1 



ay 



(53) 



When f is strictly less than unity, the convergence is exponential, as expected from 
the qualitative argument given above and in agreement with the result from the 
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continuous-time equations below. When Hindoo v(t) = 1, then, since the second 
term on the right hand side is nonzero as t — > oo, we must have 7(i) = 11*= 1 v(t) = 
C s (t+l) N / 2 , where C 5 is a constant. This implies v(t) -> ((t + l)/*)^/ 2 ~ 1 + 0(l/t). 
By calculating the correction to the sum because of this variation in v(t), we find that 
the amplitude of the 5— function obeys A t (t) — Aao oc t" 1 as t — > oo. When N = 2, 
v(t — > oo) is exponentially close to unity, so that the decay of the amplitude in the 
unmutated peak up to extremely long times is governed by the logarithmic divergence 
of the sum for v = 1. Thus, in this case we expect A t (t) to decay logarithmically with 
t. 

In figure 2 we plot A t (t), the fraction of the population which is unmutated, versus 
time t for the parameter values 9 = 0.2 and Q = 0.1, for N = 1, 2, and 3. The curves 
were obtained by numerical iteration of the recursion relations Eqs. fl45H48), starting 
with the initial condition A (t = 0) = 1 and Ai(t — 0) =0 for i ^ 0. We find, as 
expected, that A t {t) decays to zero for N = 1 and N = 2 but not for N = 3. The 
scales on the plot were chosen to emphasize the logarithmic decay when N = 2. 

To compute the probability function t) we first calculate the amplitudes Ai(t), 
using the recursion relation, and then sum the corresponding Gaussians (see Eqs. (|19|) 
and (|T0|)) to obtain numerical values of the probability. These results are plotted in 
figures [| [I], and || for the parameter- values 9 = 0.2 and Q = 0.1. For N = 1, see 
figure 0, we plot t) against the magnitude of x for various values of the time, 
t. The picture shows a probability distribution which first gets narrower, but then 
settles down to a time-independent behavior for the largest times. Also drawn on this 
plot is the infinite-time probability, derived from Eqs. ([34]) and (^). The probability 
4>(x,t) settles down to this limiting behavior most slowly near x = 0, where fitnesses 
change slowly with x. 

For iV = 3, a corresponding calculation shows a probability function which gets 
more and more peaked as time goes on, see figure f|. Notice how, for t = 1000, 
the peak sticks up sharply from a more slowly varying background. To see how this 
peaking occurs, we plot in figure § a scaled version of figure |j. In this version, we plot 
4>s(x s , t) where s (x, t) is the result of dividing <p(x, t) by the predicted growth of the 
peak, proportional to t N / 2 , giving us a scaled y-variable (p s (x,t) = (fi(x,t)t~ N / 2 . The 
scaled x- variable is x s = xt 1 / 2 . The resulting plot becomes time- independent for large 
times and not-too-large values of x s . In this way, we see how the peak continually 
gets narrower, and more and more dominates the small-x behavior. 

Now we turn to the question of the role of initial conditions in the behavior of 
discrete-time model. This question is important because the method of solution of 
the discrete-time model outlined in this section assumes that the initial distribution 
4>(x, t — 0) is Gaussian. Here we present our arguments why the results of the 
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calculation should not depend sensitively on this choice of initial condition. 

First, note that any initial distribution <fr(x, t = 0) which has a width much greater 
than both V s and m? will narrow to two Gaussians of widths V s and m 2 after one 
mutation step. Therefore, the behavior of any broad distribution will be indistin- 
guishable from that of a Gaussian of width ^ m 2 . 

Now we consider more general variation of the initial conditions by examining the 
time-dependent Eq. (51). Every term in the sum arises from a mutation event at 
some time t > 0; changing the initial conditions changes only the last term on the 
right hand side. In the t —>■ oo limit, the the amplitude of the 5— function is obtained 
by subtracting from unity the fraction of the population which has mutated at some 
time. The long-time limit, v, does not depend on the initial condition; to see this, 
note that either v — 1, in which case Aoo is obtained by the subtraction just described, 
or else v < 1, in which case = and the amplitudes of all the terms that depend 
on the initial conditions vanish exponentially. Therefore, the initial conditions do not 
affect either the presence or absence of the 5— function or its steady-state amplitude 
when it is present. 



4 Continuous-Time Equations 

In this section we consider the continuous-time evolution described by Eq. (|9|). In 
(|Biirger fc Bomze, 1996|) it is shown that there is a unique time-independent solution 



as t — > oo and the possible solutions are classified using general properties of linear 
operators QReed fc Simon, 1972| ). Our results for the long-time limit are consistent 



with theirs. In addition, our approach enables us to interpret simply the emergence (or 
not) of the S— function peak as well to address the time-dependence of the approach 
to the t — > oo limit. 

We proceed by Fourier transforming Eq. (Jf^). We define the quantity n(k,t) = 
J d N x (p(x, t) exp[ik ■ x] and obtain: 



dn(k, t) 



n(k,t)-Q[l-f(k)]n{k,t) , (54) 



dt 

where f(k) is the Fourier transform of Eq. @: 

f(k) = exp[-m 2 k 2 /2] . (55) 

We must also specify, in addition to the initial conditions, two boundary con- 
ditions. One boundary condition is determined by the normalization requirement 
/ d N x 4>(x,t) = 1, which yields: 

n(£ = 0,*) = l. (56) 
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The second boundary condition is that n(k, t) < 1 for all k, which in particular implies 
that it cannot diverge as k — > oo. This follows from combining the normalization 
condition with the non-negativity requirement 4>{x, t) > 0: 



n(k,t) = J d N x e^ x (j)(x,t) < J d N x 0(f,£) = 1 . (57) 

We now solve this equation in the limit of long times. Since as t —>■ oo both n and 
a become independent of time, we must solve: 



-Vl-(QV)f(k)^(k) = Em, (58) 

where i/j(k) = lim^oo n(k, t) and the eigenvalue E = V (lim^oo a(t) — 0). Eq. (|5£|) 
is just the time-independent Schrodinger equation flSchiff, 1968| ). The parameter N, 
the number of characters affected by each mutation, here is interpreted as the spatial 
dimensionality. As shown in ( [Burger fc Bomze, 1996 , and references therein) and 



as discussed here in the Appendix, the long-time solution to the population biology 
model is given by the lowest energy eigenstate of this Schrodinger equation. 

The kinetic energy term — ~V| in Eq. ( p8|) comes from the fitness selection. This 
term, which in if-space causes the distribution to become narrower and narrower, 
acts to cause it to become wider and wider in /c-space. The potential energy term 
—QVf(k) comes from the mutations. This potential is constant at large k, and has 
an attractive piece at small k. If we assume that the potential has a characteristic 
scale in k, so that f(k) = F(mk) (clearly the fitness function considered here is of this 
form), then we can define z = mk and write the Schrodinger equation in dimensionless 
form: 

-^V|- (^)F(z)}^) = mz\ (59) 

where E = E/m 2 . This equation makes it clear that the nature of the behavior 
depends only on the single parameter (QV/m 2 ). 

If only the selection (kinetic energy) term were present, the solutions to Eq. (p5|) 
consistent with the normalization condition <p{k) = 1 are 

ip(z) = exp[ip ■ z\ , (60) 



where p = zv 2E, and z is a unit vector along z. To be consistent with the boundary 
conditions, we require E > 0. The ground state eigenstate thus has E = 0, so that 
i[>(z) = 1. Fourier transforming this result, we see that in the absence of mutations, at 
t = oo the entire population has the same genotype. This result is consistent with our 
calculation in the previous section demonstrating that in the absence of mutations 
the variance of <j){x,t) narrows in proportion to 1/ time. 
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So now we consider the effects of adding the potential arising from the mutation 
term. This potential is attractive, so that the question we must address is whether 
the ground state eigenstate remains extended (ip(z) — > constant as \k\ — > oo) or 
whether it results in a bound state with E < 0, which has ip(z) —>■ exponentially 
in k as k — > oo. In the latter case, the real-space distribution <p(x) is smooth as 
x — > 0. Therefore, if there is a bound state, then only an infinitesimal fraction of the 
population is unmutated, whereas if there are no bound states, then a finite fraction 
of the population has a unique genotype. 

It has been proven for the Schrodinger equation that any attractive potential, no 
matter how small, will lead to a bound state for N < 2 ( Simon, 1976j ). Thus, the 



population distribution in real space <j)(x,t — > oo) must be smooth for N <2. When 
N > 2 bound states only appear if the potential is large enough (QV s /m 2 > Cn, where 
Cn is of order unity) ( |?chift', 1968 ). Therefore, in this case, if the potential is small 



(weak mutation effects), then the ground state remains extended: ip(k) — > constant as 
k — > oo, and the real space distribution <f>{x) has a 5— function piece. If the potential 
is large (strong mutation effects), then there is a bound state, and the distribution 
4>(x) is not singular at small x. 

If all the states are extended, then as \k\ — ► oo the lowest energy state obeys 
V|n(&) = and has the form n(k) — > A + Bk~^ N ~ 2 \ where A and B are coefficients 
determined by matching to the solution in the small \k\ region.^ 

For almost all potentials both A and B are nonzero; Fourier transforming the 
term proportional to B yields: 

B term oc J dn J™ dk k N - l e ikxcos ( e ) fc~( N ~ 2 ) K x -z . ( 61 ) 

(Here, / dfl is the integral over angles.) Thus, just as in our discrete-time solution, 
we find associated with a 5-function at x = a power-law divergence at small x, 
4>{x) oc x~ 2 . 

In the regime where the ground state is bound, there is a finite energy gap between 
the ground state and the excited states, which implies that the the long-time limit is 
approached exponentially in time. When there is no bound state, the energy spectrum 
of the Schrodinger equation is continuous, which leads to power-law convergence to 
the long-time limit. These points are discussed in greater detail in the Appendix. 



n This form can be verified for all N by several means, the most straightforward being direct 
evaluation using Cartesian coordinates. A more elegant method is to note that we are looking for a 
function g which satisfies = V 2 g = div-grad g. Since grad g is a current j that is angle-independent 
and obeys divj = 0, then J j dS r — 0, where S is the surface of a sphere of radius r. Therefore, one 
must have j oc r^^ 1 ), so that g oc r~( N ~ 2 \ 
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5 Relation between the continuous-time and the 
discrete-time models 



In this section we discuss the similarities and differences between the results of our 
discrete-time analysis valid when V s /m 2 <C 1 and those of the continuous-time anal- 
ysis. The basis of our discussion is Eq. (H), which can be used to describe both 
cases. 

Qualitatively, the discrete and continuous models exhibit very similar behavior. 
They both lead to only smooth distributions (a "normal" phase) for N < 2, and 
for N > 2 display a transition as the mutation rate is decreased between the normal 
phase and a "condensed" phase, where a finite fraction of the population has a unique 
genotype. In both models, the convergence to the long-time limit is power law in 
the condensed phase and is exponential in the normal phase. However, there are 
quantitative differences between the two models, particularly the detailed dependence 
on model parameters. 

In the continuous-time model, the form of Eq. ( (59| ) makes it clear that the behavior 
depends only on the single parameter QV/m 2 . In contrast, the discrete model that 
we solve in the limit m 2 /V s ^> 1 (Eqs. (|24| - |26|) ) depends on two parameters: Q = 
(1 + m 2 /V S )~ N / 2 and 9. For N > 2, the discrete model might or might not have 
a condensate, depending on the value of the ratio 9Q/(1 — 6). In this regime, the 
continuous-time model is always in the condensed phase. 

We use Eq. (||) to discuss the relation between the continuous- and discrete-time 
models. One can Fourier transform this equation (as we did in section f| for the 
continuous-time model) and interpret the result as a quantum-mechanical problem 
of a particle in a potential which is applied only at the discrete times t = nr as in 
Eq. 

The discrete model has two parameters while the continuous model has only one, 
since Eq. (H) depends not just on the mean mutation rate G but also on r, the time 
between mutation events. Thus, in principle the discrete-time model has one more 
parameter than the continuous-time one. If r is increased by a factor of two, then the 
product 9V (8 the mutation probability per generation and V the viability selection 
parameter) remains constant, but V itself is halved. The analysis in section 0, which 
is valid when V/m 2 is very small, and hence in the limit when r is very long, yields 
a solution in which that both 9 and V/m 2 enter explicitly and not just as a product. 
Nonetheless, two models have very similar outcomes. 



12 We take the limit where a very strong potential is applied for a very short time at each t n = nr. 
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6 Comparison with sequence database statistics 



Here we present our investigations of the statistical properties of DNA sequences 
archived in genetic databases. The aim is to see if observed sequence variability of 
a gene is correlated with its degree of pleiotropy. The focus in this section is on a 
qualitative prediction of the model, that genes with a high degree of pleiotropy should 
have a narrower distribution of alleles than those that affect only one trait. This trend, 
which is consistent with the mathematical results of the previous sections, is easy to 
understand: if a gene of high fitness is pleiotropic, then because each mutation affects 
several characters independently, the chances are high that a given mutation leads to 
a large fitness decrease. 

Our first test for possible correlation between degree of pleiotropy and the proba- 
bility distribution describing genetic variation is simply to count the number of natu- 
rally occurring variants of various genes in the Drosophila database FlyBase ( FlyBase" 



1 1998Q . We choose to include naturally occurring alleles, including spontaneous muta- 
tions arising in laboratory stocks, of the genes analyzed. Mutagen-induced variants 
are not included, since there is no evidence that they exist in natural populations. 
Table 1 shows data for nine genes: brown, cinnabar, ecdysone receptor, engrailed, 
fork head, hairless, notch, vestigial, and white. The number of naturally occur- 
ring variants is tabulated together with the length of the primary transcript and the 
approximate length of the genomic region encompassing the gene and its flanking reg- 
ulatory regions. Transcript and genomic lengths are taken from the full-format gene 
reports in FlyBase. For genes with multiple transcripts, we report the length of the 
longest described transcript. When the full gene report does not include the genomic 
length, this is estimated from the FlyBase molecular map of the gene. The number 
of variants per gene is normalized both by the transcript length and by the genomic 
length. In both cases, the values range over approximately three orders of magnitude. 
We estimate the degree of pleiotropy of each of the listed genes according to the num- 
ber of tissues/structures and developmental stages in which the genes are expressed. 
This ordering is admittedly arbitrary to some extent, but few would argue with the 
conclusion that fork head (encoding a transcription factor essential for development 
of the midgut) and engrailed (encoding a homeotic gene establishing segmentation) 
are more pleiotropic than the eye color mutants cinnabar, white, and brown. Inspec- 
tion of Table 1 reveals a clear inverse correlation between estimated pleiotropy and 
number of variants/transcribed length or variants/genomic length. This relationship 
is consistent with the prediction of Waxman and Peck's model. 

A weak point of this approach is that other factors besides degree of pleiotropy 
could lead to the observed differences in the number of variants. For instance, even 
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two genes which only affect single traits could have vastly different variabilities, if 
the traits have greatly different effects on overall fitness. For example, a gene that 
controls eye color may well have a much smaller overall effect on fitness than a gene 
that controls a crucial developmental function. In the population genetics model, 
differences of this type are reflected by different values of the fitness parameter V s . 
Within the model, differences in overall variability caused by changing V s can be 
distinguished from those caused by changing the degree of pleiotropy N because they 
yield very different functional forms for the distribution function <f)(x). However, 
counting mutations yields no information about the functional form of the distribution 
function and thus cannot be used to distinguish between the mechanisms. 

Our second strategy for relating degree of pleiotropy to statistics of archived DNA 
sequences aims to obtain information about the form of the probability distribution 
for various genes. It assumes that two genetic variants whose sequences are highly 
similar in sequence space code for genes that are close together in fitness space. We 
use the BLAST 2 sequence similarity search tool flAltschul et al, 1997|) to search 
the GENBANK and EMBL DNA sequence databases^ against the 8 gene sequences 
listed in Table 2. These databases contain sequences of genes from many organisms, 
though a preponderance of the sequences are from humans. Given a target sequence, 
BLAST 2 generates a list of matching sequences along with scores which measure 
the degree of similarity. Because the maximum possible score for a given sequence is 
larger for longer sequences, we avoided examining genes that were either very short or 
very long. As can be seen in table 2, the lengths of the sequences examined varied by 
less than a factor of ten. Only genes were used whose list of matches was truncated 
because the score fell below the cutoff value of 40; no genes were used whose match 
list was truncated because the number of matching sequence exceeded BLAST 2's 
maximum number of matches of 500. As shown in table 2, the number of matches 
obtained also varies by roughly an order of magnitude for the genes in the table, and 
is not obviously correlated with the length of the gene. 

High BLAST 2 scores correspond to sequences which match the search sequence 
the most closely. The scores range from roughly 10 4 (a sequence matching itself) to 
the default cutoff of 40. We assume that this score is inversely correlated with the 
evolutionary distance between the input gene and the retrieved sequences, so that 
a gene with a high fraction of matches with low scores has a broader probability 
distribution 4>(x) than a gene with more matches with high scores. Thus, if all the 
matches tend to be very good, the gene is considered to be less variable than one 
with many poor matches. We examine the statistics of the closeness of the matches, 
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These databases and the BLAST 2 search tool are maintained by the National Center for 
Biotechnology Information (http://www.ncbi.nlm.nih.gov). 
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in particular the fractions of the total number of matches that fall into given ranges 
of scores. 

Figure |5] is a plot of a histogram of the fraction of the matches in a given score range 
obtained versus the inverse of the score for several genes. We do observe significant 
variability between these genes; for example, cytochrome P-l-450, which codes for a 
general-purpose antioxidant expressed in many tissues and is thus plausibly highly 
pleiotropic, has relatively many very good matches compared to rhodopsin, which 
is quite specific, coding for a protein essential for vision. However, we cannot claim 
that there is an unambiguous correlation between degree of pleiotropy and these 
plots. For example, HOX Al has many poor matches, where it might be expected to 
be pleiotropic, since it is essential to many developmental functions. Thus, though 
interesting variations in the statistics of DNA sequences for different genes are found, 
we have not been able to demonstrate convincingly a correlation between degree of 
pleiotropy and these statistical variations. 

The behavior of HOX Al can be explained by a process of repeated duplication 
and divergence over the course of evolution. Many of the retrieved genes have assumed 
distinct but related developmental functions. The BLAST 2 analysis presented here 
does not account for a shift in score distributions arising from divergence among 
members of gene families and superfamilies. Another fundamental difficulty with our 
analysis is illustrated by the case of alpha-tubulin, which codes for a protein vital to 
the formation of microtubules. On the one hand, microtubules are expressed in many 
different tissues, but on the other hand, all its functions arise from similar structural 
properties. Thus, one could categorize alpha-tubulin as either highly pleiotropic (since 
it is expressed in many tissues) or as non-pleiotropic (since the function is similar 
everywhere where it is expressed). In short, though degree of pleiotropy is defined in 
the populations genetic model, it is not clear how to quantify it in terms of biological 
function. 

There are several additional difficulties and ambiguities inherent in comparisons 
between genome sequence data and population genetics models of the type consid- 
ered here. First, the database scores are based on the quality of sequence alignments, 
whereas the population genetics models define distances in terms of fitness, which is 
a phenotypic quantity. Fitness distance and evolutionary sequence distance may be 
quite different; for example, point mutations at different locations may range in effect 
from unobservable to lethal. However, even if there is not a single definite relation- 
ship between sequence distance and fitness distance, so long as these quantities are 
positively correlated with each other, meaningful results may be obtained. Variabil- 
ity in the relation between the distances will result in noisy data, but different genes 
are all subject to similar uncertainties. Therefore, it is plausible to hope that, if sig- 
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nificant differences between the sequence variability of two genes are observed, they 
reflect differences in the fitness distribution </>(#) of the population genetics models. 
Another limitation of our analysis is that we consider the fitnesses of alleles as fixed. 
This is not necessarily the case, as many gene products function as components of 
multi-protein structures or pathways( |Lewontin, 1974] ), ( pover fc Flavell, 1984] ). This 
dependency of the fitness value of a specific sequence variant on the remainder of 
the genotype is manifested explicitly by the existence of epistatic interactions (e.g., 
(|Fijneman et a/., 19961 ), (|van Wezel et al, 1996Q ). 



In addition, one may worry that the population genetics model we have used 
involves a continuously varying phenotype, whereas sequence variations are discrete. 
However, because genes are thousands of base pairs long, there is still a huge range 
of variability, and the use of a continuum model is therefore reasonable. 

Finally, as mentioned above, pleiotropy itself is defined in terms of phenotypic 
fitness, and it is not clear how to create an independent measure that enables one to 
assign objectively the degree of pleiotropy N to a given gene. This is a serious fun- 
damental difficulty that must be overcome if one is to make meaningful quantitative 
statistical analysis of the possible correlations between the sequence statistics and the 
degree of pleiotropy. 

7 Discussion 

In this paper we have analyzed some variants of a population biology model incorpo- 
rating selection, mutation, and pleiotropy. We have focussed on understanding the 
circumstances under which at long times a nonzero fraction of the population has a 
unique genotype, and on characterizing the time dependence of the population distri- 
bution in this regime. We have analyzed both the discrete-time model of ( |Waxman 
|fc Peck, 19981 ) as well as an associated continuous-time model. We find: 



1. In both the discrete and continuous-time models, a unique genotype can emerge 
only when N, the number of characters affected by each gene, is greater than 
two, a result in agreement with ([Waxman fc Peck, 1998] ) . For any N > 2, 
the infinite-time population distribution <fi(x,t — > oo) contains a 5— function 
contribution when the mutation rate is nonzero but small, but not when the 
mutation rate is large enough. 

2. In the regime where <f)(x,t — > oo) has a 5-function contribution, this 5-function 
is accompanied by a 1/x 2 singularity at small x. 

3. The 5- function peak emerges as the limit of a peak that continually becomes 
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higher and narrower. Thus, in this regime the convergence to the t — > oo limit 
is power-law. 

4. In the regime when <f>(x, t — > oo) is smooth, the convergence to this distribution 
is exponential in time. 

5. The continuous- and discrete-time models exhibit qualitatively identical behav- 
ior, but there are quantitative differences between them. 

Our analysis of the discrete-time equations relies on the use of Gaussian functions 
for both the fitness and mutation terms. However, our analysis of the continuous- 
time model assumes only that the potential in the quantum mechanics problem is 
attractive and short-ranged. Therefore, we expect our results to apply qualitatively 
for a large class of different mutation terms. 

Other mechanisms giving rise to genetic diversity such as antagonistic pleiotropy, 
genetic drift, and discreteness of alleles change qualitatively the nature of the equa- 
tions describing the system. It will be interesting to see whether the "condensation" 
phenomenon investigated here is robust when these effects are taken into account. 

In addition to our mathematical analysis, we also attempted to assess whether 
the degree of pleiotropy of genes resulted in systematic trends in the degree of varia- 
tion they exhibited in the mutations documented in online genetic databases. These 
attempts to correlate the degree of pleiotropy with the statistics of the sequence 
variations were inconclusive. 

Available database resources do not allow unbiased sampling of the sequence vari- 
ation present in a natural population. Such a database is likely to emerge as the 
Human Genome Project's present initiative to sample the extent of sequence varia- 
tion in a wide array of genes in the American population progresses. 
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Appendix: The relation between eigenstates of the 
Schrodinger equation and the long-time behavior of 
the continuous-time population genetics model 

In section [| above we use the result that the long-time behavior of Eq. (j54|) is given 
by the lowest energy eigenstate of the associated Schrodinger Eq. fl58l) . This result 
has been shown by Burger and collaborators (Purger fc Bomze, 1996| , and references 
therein); for completeness we derive it here and emphasize the implications for the 
approach to the long-time behavior in the population biology model. 

We assume that we know the solution of the time-independent problem, which is 
a set of eigenstates u n (x) which satisfy: 

- ~Vz 2 u n {z) + V{z)u n {z) = E n u n {z) . (62) 

The eigenstates u n (z) are complete (any function can be written as a linear combi- 
nation of them) and can be chosen to be orthonormal: 

in 



d z u* ni (z)u n2 (z) = 5 nun2 , (63) 

where the star denotes complex conjugate. This normalization is usual in quantum 
mechanics( [Schiff, 1968| ) . 

We need to examine the time-dependent Eq. fl54|); using scaled variables, it is 
written as: 



dt 

where 



^(z,t)+V(z)^(z,t) , (64) 



a(t) 



~v2 + v(*) 



V>0M)k=o. (65) 



We write the wave function at time to, to), as a linear combination of eigenstates: 

ip(z,t ) = ^2bi(to)ui(z) . (66) 

i 

The condition i[)(z = 0,to) = 1 is enforced by requiring J2i h(to)ui(z = 0) = 1. By 
substituting the form (^) for ip and using the orthonormality of the eigenstates, one 
finds equations for the time-evolution of the coefficients bf. 

- d ^ = b l (t)[E l -a(t)}. (67) 
Integrating this equation yields: 

h{t) = b t {t G )A{t)e- E ^ , (68) 
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where A(t) = exp J** ds a(s) . Thus the ratio of the weights of any two eigenstates 
i\ and %i is given by: 

7— 777 = 7 77T ex P ~ ^1 ~ ■ 69 

b i2 (t) b i2 {t ) 

At long times the exponential factor ensures that the state with the largest weight is 
the one with the lowest energy. 

If there is a nonzero energy gap between the lowest and second-lowest energy 
eigenstates (no 5-function in 4>(x)), then Eq. (|69"D implies that the approach to the 
long-time limit is exponential in time. If there is no bound state, then all the states 
are extended and the spectrum is continuous; the E^ take the form Ei = qf/2, where 
qi has the dimension of a wavevector in z space, QSchiff, 19681) and hence a distance 
in the original x space. Therefore, 

i>(z,t) = A(t)J2h(to)e-& /2 . (70) 



The relative weight of the eigenstate i remains large until g« ~ yl/i Thus, the 
5-function at infinite times emerges from a peak that is narrowing, having a width 
proportional to Jl/t. 
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gene 


PR 

J. It 


i v y 


T 




N V /L T 


N V /L G 


fork head 


1 


l 


4.0 


13 - 60 


2.5 x 10' 


-4 


(1.7- 


- 7.7) x 10" 


-5 


engrailed 


2 


19 


3.6 


200 


5.3 x 10" 


-4 




9.5 x 10- 


-5 


ecdysone receptor 


3 


13 


5.5 


35 


2.4 x 10" 


-3 




3.7 x 10- 


-4 


notch 


4 


70 


15 


80 - 150 


4.7 x 10- 


-3 


(4.7- 


-8.8) x 10- 


-4 


vestigial 


5 


55 


3.8 


46 


0.014 






1.2 x 10- 


-3 


hairless 


6 


17 


6.0 


8 


2.0 x 10" 


-3 




2.1 x 10- 


-3 


cinnabar 


7 


22 


2.5 


15 


8.8 x 10- 


-3 




1.5 x 10- 


-3 


white 


8 


247 


6.0 


48 - 200 


0.041 




(1.2- 


- 5.1) x 10- 


-3 


brown 


9 


61 


3.0 


140 


0.020 






4.4 x 10- 


-4 



Table 1: Table of number of naturally-occuring mutations for different Drosophila 
genes. PR is pleiotropy rank, estimated using the number of tissues/structures and 
developmental stages in which the genes are expressed. Ny is the number of naturally- 
occurring variants of the gene, Lt is the transcript length, and Lq is the genomic 
length. All lengths are given in kilobases. 



gene 


length 


number of matches 


cytochrome P-l-450 


2565 


81 


dystrophin 


13957 


216 


epidermal growth factor receptor 


2660 


57 


BOX Al (human) 


2595 


380 


huntingtin 


10348 


108 


iodothyronine deiodinase 


2222 


159 


rhodopsin 


6953 


208 


alpha-tubulin 


1596 


488 



Table 2: Genes examined using BLAST 2 sequence similarity search tool. Lengths of 
genes are given in units of the number of bases. 
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Figure 1: Sketch of transitions between a^s in the limit V s /m 2 -C 1. Each aij corre- 
sponds to a Gaussian in <p(x,t). 
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Figure 2: Plot of the fraction of the population which is unmutated, A t (t), versus 
time t for the discrete-time model with parameter values Q = 0.1, 9 — 0.2, obtained 
by numerical iteration of Eqs. ([I^-fll) starting with A (t = 0) = 1, Ai^{t = 0) = 0. 
For N = 1 and 2, A t (t) decays so that Aoo = lim^oo A t (t) = 0, while for N = 3, Aoo 
is nonzero. 
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Figure 3: Plot of probability function <ft(x,t) against the magnitude of x for N — 1. 
The numerical results for times 10, 30, 100, and 1000 are calculated by summing 
Gaussians with weights computed from the recursion relations. This set of curves is 
compared with the expected infinite-time result, calculated as a solution to Eqs. 
and (|35|) . For large times, the two types of calculations agree quite well. 
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Figure 4: Plot of probability function <f>(x, t) against the magnitude of x for N = 3. 
The numerical results are shown for times 10, 30, 100, and 1000. In contrast to the 
N — 1 case, these curves contain a central peak which continues to narrow and to 
grow for large t. 
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Figure 5: Scaled plot of probability function versus the magnitude of x for N = 3. 
The scaling is picked to make the central peak show a time independent behavior in 
the new coordinates. The abscissa is xt 1 ^ 2 , so that for larger time the picture focuses 
upon smaller values of x. The ordinate is (j>(x, t)t~ N ^ 2 , and thereby the picture focuses 
upon increasingly concentrated distributions for larger t. Numerical results are shown 
for times 10, 30, 100, and 1000. In these coordinates, the figure shows (as expected) 
an approach to a constant value at large times. 
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Figure 6: Histogram plot of the fraction of matches with inverse scores in a given 
range, as a function of inverse score, for several genes with differing degrees of 
pleiotropy. Significant differences in the statistics of these matches are observed; 
for example, cytochrome P-l-450 has a very large percentage of matches with high 
scores, whereas the matches for rhodopsin tend to have low scores. 
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